Computer-Using Agent
https://openai.com/index/computer-using-agent/
Powering Operator is Computer-Using Agent (CUA), a model that combines GPT‑4o's vision capabilities with advanced reasoning through reinforcement learning.
CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen—just as humans do.
人間のようにGUIを操作(逆にAPIは使わない)
👉schroneko/systemprompts chatgpt_operator_2025-02-22.txt
While CUA is still early and has limitations, it sets new state-of-the-art benchmark results, achieving a 38.1% success rate on OSWorld for full computer use tasks, and 58.1% on WebArena and 87% on WebVoyager for web-based tasks.
How it works
CUAへの入力
タスクはテキスト
スクリーンショットも与えられる
CUAはアクションを生成(テキストと理解)
アクションを仮想マシンに適用し、次のスクリーンショットを得る
タスク(テキスト)とともに入力される
https://images.ctfassets.net/kftzwdyauwt9/66EMZgoHtZBCDjjTZiWodY/e0472c662472755fa576522ce12a457d/Infographic_Transparent__Mobile_.png?w=1200&q=70&fm=webp
CUA processes raw pixel data to understand what’s happening on the screen and uses a virtual mouse and keyboard to complete actions.
仮想的なマウスとキーボードを持っている
Given a user’s instruction, CUA operates through an iterative loop that integrates perception, reasoning, and action
Evaluations
ブラウザ操作のベンチマーク
コンピュータ操作のベンチマーク
積ん読